This assignment is for ETC5521 Assignment 2 by Team Cassowary comprising of Sahinya Akila and Xinrui Wang.

1 Introduction and Motivation - Rachel to edit

Employment and earning is one of the most frequently discussed topics of all time, gender and race are often brought up in terms of fairness in workplace. Fekedulegn et al. (2019) suggested that workplace discrimination and mistreatment varied significantly by race and gender in the US, this statement raises the interest on exploring and conducting a detailed analysis in regards to the employment and earnings across different industries in the USA, and to find out if this statement is true and how significant gender and race are affecting employment and earning.

The data used in this report is collected from tidytuesday, by looking through the employment status and earning from 2010 to 2020 across different races, genders as well as age groups in various industries in the US, the findings will assist with promoting fairness, equality and diversity in the workplace.

Analysis conducted and conclusions drew in this report are solely based on the datasets described under Data Description section, all records in the datasets are assumed to be accurate. Furthermore, due to the inadequate information in regions and inconsistency of time frame in the two datasets used in this report, the findings could be subject to potential bias.

2 Data Description - Sahinya to edit

The datasets originally come from BLS, specifically table cpsaat17 across several years.

The employed dataset tells about employed persons by industry, sex, race, and occupation through 2015 to 2020.

Variable Data Type Description
industry character Industry Group
major_occupation character Major occupation category
minor_occupation character Minor occupation category
race_gender character Race & Gender wise information
industry_total double Industry total count
employ_n double Number of people employed
year double Year

The earn dataset tells about weekly median earnings and number of persons employed by race/gender/age group through 2010 to 2020.

Variable Data Type Description
sex character Gender
race character Racial group
ethnic_origin character Ethnic origin (hispanic or non-hispanic)
age character Age group
year double Year
quarter double Quarter
n_persons double Number of persons employed by group
median_weekly_earn double Median weekly earning in current dollars

The datasets are collected from the Current Population Survey (CPS) which is a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics.

Here are some findings when looking through the methods used to tidy and wrangle data from the original source:

Based on the datasets, five questions are going to be explored and analyzed in the following section, including:

3 Analysis and Findings

3.1 What are the changes of people employed in different industries from 2015 to 2020? - Sahinya to edit

Number of people employed across industry from 2015 to 2020

Figure 3.1: Number of people employed across industry from 2015 to 2020

First of all, Figure 3.1 shown above indicates the changes of all the population of employees from different industries in recent 5 years. To be more specific, there is a large number of people working in the industry of education and health services, and the population stayed stably between 34 million and 35 million during 2015 to 2020. However, as the industry of private households hold the least population, the number of people employed in this industry actually decreased from around 0.7 million to 0.6 million. In addition, all industries experienced the decrease of people employed within the industries from 2019 to 2020 except for the public administration.

3.2 What are the demographic differences between industries from 2015 to 2020? - Sahinya to edit

3.2.1 Gender

Figure 3.2: Distribution of men and women across industries

In the analysis about genders in different industries, it is found that there are only five industries that have more female employees than male, which are education and health services, financial activities, leisure and hospitality, other services and private households (Figure 3.2). Especially in the industry of education and services, the number of female employees is more than twice as much as the number of male employees. On the contrary, male workers occupy most of the roles in some industries like manufacturing, construction, transportation and utilities and durable goods. More than 90% of the employees are male in the industry of construction.

3.2.2 Race

Figure 3.3: Distribution of different races across industries

According to Figure 3.3, when looking at the relationships between industries and the population employed among races from the data, most of the people employed among all the industries are white people, following by Black or African American and Asian.

3.3 At what age do men and women work the most and how does the age factor contribute towards employement? - Sahinya

Employment rate by gender and age group

Figure 3.4: Employment rate by gender and age group

It can be observed from Figure 3.4 that both men and women in between the age 16 to 54 have been employed more when compared to other age groups. It is also evident that the number of male employees are more when compared to women. There is a peak in 25-54 age group as this is the age when people finish education and start their career. This also happens to be the prime working time in most of their lives. As one intends the curve to be, there is a peak and the 25-54 age group and the numbers slowly go down after 55 years as people start their retirement phase.

3.4 How do different factors affect the income between 2010 and 2020? - Rachel to edit

When taking a look at the earning data, median weekly income varies through different genders, races and age groups.

3.4.1 Gender & Race

Race and gender do play significant role in income

Figure 3.5: Race and gender do play significant role in income

Figure 3.5 indicates that gender and race do play significant role in affecting weekly income through the past ten years. A clear upward trend in income can be observed in general over the period, the upper vertex of the segments represents male’s income and the lower one represents female’s, which clearly shows that men generally earn more than women in all years and races from 2010 to 2020. In addition, the plot suggests that race is also a key factor affecting income. A surprising finding is that while the number of Asians employed are very low across all industries as indicated in @ref(3.2.2), Asians actually have the highest median weekly income among the three races recorded, followed by the white race while the black or African American earns the least. This may reflect differences in the amount of time and energy that people of different races are willing to devote to their jobs, Asians are well-known for hard working and are more likely to work extra hours compare with the other two races. On the other hand, another possible reason is that there is a common belief that Asians are smart and tend to be educated for high-income occupations such as doctors and lawyers, while Black and African Americans may suffer from racial discrimination and are forced to work in low-income jobs.

3.4.2 Age group

Median weekly income by year and age group

Figure 3.6: Median weekly income by year and age group

Based on Figure 3.6, income levels at different age groups are all growing over the years. The Y-axis is divided by the minimum, 1/4 quantile, median, 3/4 quantile and maximum income of the total median weekly income. The plot interactively demonstrates that young adults earn much less than middle-aged people and there’s not much difference between age groups over 35. The finding is reasonable based on common sense, where people at age of 16-24 are most likely school leavers and full-time students who are working part-time, the income for this group are lower considering the number of hours they can work each week and the skill level of the occupations/positions they can get. 25-34 years old on the other hand, are more likely in the earlier stage of their career and working in entry level positions, the wages for these positions are generally higher but still not as high as senior positions, where majority of the age group 35 and over are working in.

3.5 How significant are gender and race affecting earnings?

Based on the findings above, it is clear that the median weekly income varies across gender and race, this section focus on exploring how significant each of them is in affecting earnings, and which one of them plays the most important role in median weekly income in the US.

Distribution of median weekly income by gender and race

Figure 3.7: Distribution of median weekly income by gender and race

Figure 3.7 compares the distribution of median weekly income of male and female together with the overall distribution (the boxes without colour), it is obvious that women has lower median weekly income than not only male, but also the overall level in all three races. In addition, it confirms the findings from @ref(3.4.1) that Asian has the highest median weekly income whereas the lowest is observed in Black or African American. Furthermore, the spread of distribution is wider for male compare with female, which suggests the differences between high and low median weekly income is larger among men. The findings again, confirmed that both gender and race are significant factors in terms of earning, however, it is hard to suggest how much they are affecting the median weekly income, or which one is more significant than the other.

A model is then introduced, however, before fitting a model to the data, an important factor to be considered is that the earning data is time series data, median weekly income naturally grows across all variables of interests over the years, assumption of independence and randomness is violated in this case. The best possible solution under this circumstances is to consider year as an additional categorical variable and include it in the model.

Table 3.1: Regression model fitted to median weekly income by gener, race and year
term estimate std.error statistic p.value
(Intercept) 909.3758 17.5836 51.7174 0.0000
sexWomen -150.8788 9.3988 -16.0530 0.0000
raceBlack or African American -278.7114 11.5111 -24.2123 0.0000
raceWhite -110.8727 11.5111 -9.6318 0.0000
year2011 8.1333 22.0421 0.3690 0.7122
year2012 30.0000 22.0421 1.3610 0.1737
year2013 41.6000 22.0421 1.8873 0.0593
year2014 58.8917 22.0421 2.6718 0.0076
year2015 79.0250 22.0421 3.5852 0.0003
year2016 104.3083 22.0421 4.7322 0.0000
year2017 125.5167 22.0421 5.6944 0.0000
year2018 148.7833 22.0421 6.7499 0.0000
year2019 200.6333 22.0421 9.1023 0.0000
year2020 271.8833 22.0421 12.3347 0.0000

A linear regression model is then fitted as shown in Table 3.1, the p values for Women, Black or African American and White are all extremely close to 0, indicates that they are significant in this model. The estimates of coefficients of years are all positive and gradually increasing from 2010 to 2020, which align with the previous findings that median weekly income increase over years overall. The fitted model can be written as per below:

\[MedianWeeklyEarn = 909.3758 - 150.8788*Women - 278.7114*BlackorAfricanAmerican - 110.8727*White +...+271.833*Year2020\]

According to the model, the median weekly income of women in general is 150.8788 dollars lower than women, whereas the median weekly income of Black or African American and White are 278.7114 and 110.8727 dollars lower compare with Asian respectively. Therefore, among all the variables of interest, Black or African American in race has the most impact on median weekly income, followed by women and White in race.

Regression diagnostics for the model are also conducted to examine the goodness of fit, overall, the fitted model can explain part of the variations within the data, but there is room for improvements by introducing additional datasets and potentially more variables.

Diagnostics for the fitted model

Figure 3.8: Diagnostics for the fitted model

Discreteness can be clearly observed from the residual plot in Figure 3.8, it is mainly caused by the nature of the independent variables used in the model, all independent variables are categorical i.e. discrete in the model, hence the discreteness in residual plot is not surprising. In addition, both R squared and adjusted R squared for the fitted model is smaller than 0.5 as shown in Table 3.2, it suggests that only about 46% of the variation observed from the data is explained by the fitted model.

Table 3.2: Goodness of fit
r.squared adj.r.squared AIC BIC
0.4675 0.4622 17331.86 17409.64

Based on the findings above, race, specifically Black or African American has significant impact on median weekly earnings in the US from 2010 to 2020, followed by gender and White, however, the model could be improved by adding more datasets and new variables such as industry, education level etc., and the findings may subject to change if new model is fitted.

4 Key Findings and Limitations - Rachel to edit

Industries that generally require more physical labor and technical skills are overwhelmingly dominated by male, whereas industries with more women generally require more patience and carefulness. In terms of earnings, men generally earn more than women, and younger age groups also turns to earn less compare with those who are 35 years old and over. An interesting findings is discovered in exploring the race factor, although Asians occupy a very low proportion in number of people employed, they have higher median weekly income than White people, which have significantly higher number of people employed across industries. Black or African Americans are earning the least and with very low number of people employed across industries. The regression model also supports these findings and demonstrates that Black or African American is the most significant variable result in earning lower median weekly income, followed by being a women and White. Based on the findings, it is proved that gender and race do play significant role in both employment and earning across industries in the US, discrimination and mistreatment in the workforce could be an concerning issue in the US.

In terms of limitations, the period of records in the two datasets used are different, employment dataset is recorded from 2015 to 2020 whereas earning contains data from 2010 to 2020, the inadequateness could lead to biased findings and conclusions. Age group and ethnic origins could also be considered with additional datasets from different data sources with a longer time line in future studies, in order to draw more precise conclusions.

5 References

Fekedulegn.D, Alterman.T, Charles.L, Kershaw.K, Safford.M, Howard.V, MacDonald.L (2019).Prevalence of workplace discrimination and mistreatment in a national sample of older U.S. workers: The REGARDS cohort study, SSM - Population Health, Volume 8, 100444, ISSN 2352-8273, https://doi.org/10.1016/j.ssmph.2019.100444.

5.1 Data source

Labor Force Statistics from the Current Population Survey. (2021). Retrieved 15 August 2021, from https://www.bls.gov/cps/tables.htm#charemp_m

Tidytuesday. (2021). Retrieved 15 August 2021, from https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-02-23/readme.md

5.2 Software

R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

5.3 Packages

David Robinson, Alex Hayes and Simon Couch (2021). broom: Convert Statistical Objects into Tidy Tibbles. R package version 0.7.9. https://CRAN.R-project.org/package=broom

Hadley Wickham, Romain François, Lionel Henry and Kirill Müller (2021). dplyr: A Grammar of Data Manipulation. R package version 1.0.5. https://CRAN.R-project.org/package=dplyr

Hadley Wickham (2021). tidyr: Tidy Messy Data. R package version 1.1.3.https://CRAN.R-project.org/package=tidyr

Hao Zhu (2021). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra

Jeroen Ooms (2021). magick: Advanced Graphics and Image-Processing in R. R package version 2.7.3. https://CRAN.R-project.org/package=magick

JJ Allaire and Yihui Xie and Jonathan McPherson and Javier Luraschi and Kevin Ushey and Aron Atkins and Hadley Wickham and Joe Cheng and Winston Chang and Richard Iannone (2021). rmarkdown: Dynamic Documents for R. R package version 2.10. URL https://rmarkdown.rstudio.com.

Katherine Goode and Kathleen Rey (2019). ggResidpanel: Panels and Interactive Versions of Diagnostic Plots using ‘ggplot2’. R package version 0.3.0. https://CRAN.R-project.org/package=ggResidpanel

Kirill Müller and Hadley Wickham (2021). tibble: Simple Data Frames. R package version 3.1.3. https://CRAN.R-project.org/package=tibble

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Wilke, C.O. (2020). ggtext: Improved Text Rendering Support for ‘ggplot2’. R package version 0.1.1. https://CRAN.R-project.org/package=ggtext

Yihui Xie and J.J. Allaire and Garrett Grolemund (2018). R Markdown: The Definitive Guide. Chapman and Hall/CRC. ISBN 9781138359338. URL https://bookdown.org/yihui/rmarkdown.

Yihui Xie and Christophe Dervieux and Emily Riederer (2020). R Markdown Cookbook. Chapman and Hall/CRC. ISBN 9780367563837. URL https://bookdown.org/yihui/rmarkdown-cookbook.

Zeileis A, Hornik K, Murrell P (2009). “Escaping RGBland: Selecting Colors for Statistical Graphics.” Computational Statistics & Data Analysis, 53(9), 3259-3270. doi: 10.1016/j.csda.2008.11.033 (URL: https://doi.org/10.1016/j.csda.2008.11.033)